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Deep Learning and Music Adversaries 

Corey Kereliuk, Member, IEEE, Bob L. Sturm, Member, IEEE, Jan Larsen Senior Member, IEEE 


Abstract —An adversary is essentially an algorithm intent on 
making a classification system perform in some particular way 
given an input, e.g., increase the prohahility of a false negative. 
Recent work builds adversaries for deep learning systems applied 
to image object recognition, which exploits the parameters of 
the system to find the minimal perturbation of the input image 
such that the network misclassifies it with high confidence. We 
adapt this approach to construct and deploy an adversary of 
deep learning systems applied to music content analysis. In our 
case, however, the input to the systems is magnitude spectral 
frames, which requires special care in order to produce valid 
input audio signals from network-derived perturbations. For two 
different train-test partitionings of two benchmark datasets, and 
two different deep architectures, we find that this adversary is 
very effective in defeating the resulting systems. We find the 
convolutional networks are more robust, however, compared with 
systems based on a majority vote over individually classified 
audio frames. Furthermore, we integrate the adversary into the 
training of new deep systems, but do not find that this improves 
their resilience against the same adversary. 


I. Introduction 

Deep learning is impacting the research domain of mu¬ 
sic content analysis and music information retrieval (MIR) 
JT9), (28), (M), (^, (Mj, (44), (S^, (6^, (65), but recent 
developments raise the spectre that the high performance of 
these systems does not reflect how well they have learned to 
solve high-level problems of music listening. MIR aims to 
produce systems that help make “music, or information about 
music, easier to find” m This is of principal importance 
for confronting the vast amount of music data that exists and 
continues to be created. Listening machines that can flexibly 
produce accurate, meaningful and searchable descriptions of 
music can greatly reduce the cost of processing music data, 
and can facilitate a diversity of applications. These extend 
from music identification (59) , author attribution | fT3| , recom¬ 
mendation ( 53 , transcription (^, and playlist generation ( 3 , 
to extracting semantic descriptors such as genre and mood 
0, ( 43 , (g, to computational musicology (45) , and even 
synthesis and music composition (43) . 

Recent surveys of the domain of deep learning record 
impressive results for several benchmark problems ( 3 , (TT) . In 
addition to these major successes, deep learning methods are 
very attractive for three other reasons: there now exist efficient 
and effective training algorithms for deep learning, not to men¬ 
tion completely free and open cross-platform implementations, 
e.g., Theano they entail jointly optimising feature 

learning and classification, thus allowing one to forgo many 
difficulties inherent to formally encoding expert knowledge 
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into a machine; and their layered structures seems to favour 
hierarchical representations of structures in data. One caveat, 
however, is that these methods require a lot of data in order 
to estimate parameters and generalise well HD- 

In MIR, the works in (28) , (34) , | |T5) are among the first 
to apply deep learning to music content analysis, and each 
describes results pointing to the conclusion that these systems 
can automatically learn features relevant for complex music 
listening tasks, e.g., recognition of genre or style. Results 
since then point to the same conclusion (ID, gg, (44), (63, 
| |65) - Humphrey et al. m highlight this fact to argue deep 
learning is naturally suited to learn relevant abstractions for 
music content analysis, provided enough data is available. 
Since music can be seen as a “whole greater than the sum 
of its parts” (31) , deep learning can help MIR narrow the 
“semantic gap” (62), and move beyond what has been called 
a “glass ceiling” in performance 0 - 

However, it is now known how deceiving the appearance 
of high performance can be: an MIR system can appear to 
be very successful in solving a high-level music listening 
problem when in fact it is just exploiting some independent 
variables of questionable relevance unknowingly confounded 
with the ground truth of a music dataset by a poor experimental 


design (^, (^, (g, ( 43 -^, ( 43 , (^. In 

addition, recent work in machine learning has demonstrated 
deep learning systems behaving in ways that contradict their 
appearance of solving content-recognition problems. Nguyen 
et al. (38) show how a high-performing image object recog¬ 
nition system can label with high confidence non-sensical 
synthetic images. In a similar direction, we have shown 0 
how a deep system that appears highly capable of recognising 
different musical rhythms confidently classifies synthesised 
rhythms, though they bear little similarity to the rhythms 
they supposedly represent. Szegedy et al. (54) show how 
deep high-performing image object recognition systems are 
highly sensitive to imperceptible perturbations created by an 
adversary: an agent that actively seeks to fool a classifier by 
perturbing the input such that it results in an incorrect output 
but with high confidence (16) . 

All of these results motivate several timely questions of 
deep learning systems for music content analysis specifically, 
and multimedia in general. First, how do the adversaries of 
Szegedy et al. 0 translate to the context of deep learning 
applied to music content analysis? The input of the systems 
studied by Szegedy et al. (53 is raw pixel data; however, in 
music content analysis only the system studied in (T9) takes 
as input raw audio samples. The inputs to other deep learning 
systems have been features: windowed magnitude spectra 
1281, (43 , sonograms 0 , 0 , autocorrelations of spectral 
energies 0,0, or statistics of features | |63[ , (65) . Second, 
can we generate an adversary for such deep learning music 
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content analysis systems that produce adversarial examples 
that are perceptually identical to the originals? Third, can we 
“harness” an adversary to train deep learning systems that are 
robust to its “malfeasance”? Finally, and more broadly, what 
is deep learning contributing to music content analysis? Can 
we use adversaries to reveal whether these deep systems are 
using better models of the content than other state of the art 
systems using hand-crafted features? 

Our preliminary work p3j shows that it is possible to create 
highly effective adversaries of the music content analysis 

1^. These 


deep neural networks (DNN) studied in |28|, 
adversaries can make the systems always wrong, always right, 
and anywhere in-between, with high confidence by applying 
only minor perturbations of the input magnitude spectra. 
Furthermore, we created an ensemble of adversaries that can 
coax the DNN into assigning with high conhdence any label to 
the same music by perturbing the input by very small amounts 
(e.g., 26.8 dB SNR). In this article, we greatly expand upon 
our prior work p3) to include convolutional deep learning 
systems, more extensive testing in a larger benchmark MIR 
dataset, and the results of incorporating an adversary into the 
training of these different deep learning systems. 

In the next section, we provide an overview of work 
applying deep learning to music content analysis and MIR. We 
then review two different deep learning architectures, and our 
construction of several music content analysis systems using 
two partitions of two MIR benchmark datasets. In Sec. m 
we review adversaries, and design an adversary for our deep 
systems. We then present in Sec. |IV] a series of experiments 
using our adversary. In Section V we provide a discussion 
of our work in wider contexts. We conclude in section VI. 
Some of our results can be produced with the software here: 
https://github.com/coreyker/dnn-mgr 


II. Deep Learning eor Music Content Analysis 


We hrst provide an overview of research in applying deep 
learning approaches to music content analysis. We then discuss 
two different architectures, train two music content analysis 
systems, and test them in two benchmark MIR datasets. These 


systems are the subjects of our experiments in Section IV 


A. Overview 

Artificial neural networks have been applied to many music 
content analysis problems, p6) , for instance, hngerprinting 
genre recognition p^ , emotion recognition | [58) , artist 
recognition and even composition gg. Advances in 

training have enabled the creation of more advanced and 
deeper architectures. Deng and Yu pT] (Chapter 7) provide 
a review of successful applications of deep learning to the 
analysis of audio, highlighting in particular its signihcant 
contributions to speech recognition in conversational settings. 
Humphrey et al. HD provide a review for applications to 
music in particular, and motivate the capacity of deep ar¬ 
chitectures to automatically learn hierarchical relationships in 
accordance with the hierarchical nature of music: “pitch and 
loudness combine over time to form chords, melodies and 
rhythms.” They argue that this is key for moving beyond the 


reliance on “shallow” and hand-designed features that were 
designed for different tasks. 

Lee et al. |34| are perhaps the hrst to apply deep learn¬ 
ing to music content analysis, specihcally genre and artist 
recognition. They train a convolutional deep belief network 
(CDBN) with two hidden layers in an unsupervised manner 
in an attempt to make the hidden layer activations produce 
meaningful features from a pre-processed spectrogram in¬ 
put computed using 20 ms 50%-oveiiapped windows. The 
spectrogram is “PCA-whitened”, which involves projecting 
it onto a lower-dimensional space using scaled eigenvectors. 
Important details are missing in the description of the work, 
but it appears they use the activations as features in some 
train/test task using a standard machine learning approach. 
A table of their experimental results, using some portion 
of the dataset ISMIR2004, shows higher accuracies for their 
deep learned features compared to those for standard MFCCs. 
For genre recognition, Li et al. p5) use convolutional deep 
neural networks (CDNN) with three hidden layers, into which 
they input a sequence of 190 13-dimensional MFCC feature 
vectors. The architecture of their CDNN is such that the hrst 
hidden layer considers data from 127 ms duration, and the 
last hidden layer is capable of summarising events over a 2.2 
s duration, van den Oord et al. apply CDNN to mel- 
frequency spectrograms for automatic music content analysis. 

For genre recognition and more general descriptors, Hamel 
and Eck |28| train a DNN with three hidden layers of 50 
units each, taking as input 513 discrete Fourier transform 
(DFT) magnitudes computed from a single 46 ms audio 
frame. They use a train/valid/test partition of the benchmark 
music gem'e dataset GTZAN | [49) , | [55| . They also explore 
“aggregated” features, which are the mean and variance in 
each dimension of activations over 5 second durations. They 
hnd in the test set, and for both short-term and aggregated 
features, that SVM classihers trained with features built from 
hidden layer activations reproduce more ground truth than an 
SVM classiher trained with features built from MFCCs. They 
report an accuracy of over 0.84 for features that aggregate 
activations of all three hidden layers. Sigtia and Dixon | |44[ 
explore modifications to the system in p8) , in particular using 
different combinations of architectures, training procedures, 
and regularisation. They use the activations of their trained 
DNN as features for a train/test task using a random for¬ 
est classifier. They report an accuracy of about 0.83 using 
features aggregating activations of all hidden layers of 500 
units each. For genre recognition, Yang et al. | |6^ combine 
263-dimensional modulation features with a DBN. For music 
rhythm classihcation, Pikrakis ED employs a DBN, which we 
studied further in |5D-|53). 

Dieleman et al. ID build and apply CDBN to music 
key detection, artist recognition, and genre recognition. There 
are three major differences with respect to the work above 
|28|, p5] , |44|, 1^ . First, Dieleman et al. employ 24- 

dimensional input features computed by averaging short-time 
chroma and timbre features over the time scales of single 
musical beats. Second, they employ expert musical knowl¬ 
edge to guide decisions about the architecture of the system. 
Finally, they use the output posteriors of their system for 
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Fig. 1. Illustration of the CDNN architecture we use for our experiments. The CDNN first applies nan'ow vertical filters to the input sonogram (left) to 
capture harmonic structure. Then, it applies 32 different filters in the first convolutional layer (we show only 4). This is followed by the first max-pooling 
layer, and then a 2nd pair of convolutional and max-pooling layers. Finally, the output of the final max-pooling layer is fully connected to a final hidden layer 
of 50 units, followed by a softmax output unit. The input spectrogram contains 100 time slices, which means that the final layer of the CDNN summarises 
information over a total duration of 2.35 seconds. 


classification, instead of using the hidden layer activations as 
features for a separate classifier. Their experiments in a portion 
of the “million song dataset” | fT0| show large differences in 
classification accuracies between their systems and a naive 
Bayesian classifier using the same input features. In a unique 
direction for audio, Dieleman and Schrauwen explore 
“end-to-end” learning, where a CDNN is trained with input of 
about 3 s of raw audio samples for a music content analysis 
task (autotagging). They find that the lowest layer of the 
trained CDNN appears to learn some filters that are frequency 
selective. They evaluate this system for a multilabel problem. 

To recognise music mood, Weninger et al. [ [60| use recurrent 
DNN with input constructed of several statistics of low- 
level features computed over second-long excerpts of music 
recordings. Battenberg and Wessel ||^ apply DBN for iden¬ 
tifying the beat numbers over several measures of percussive 
music, with input features consisting of quantised onset times 
and magnitudes. Boulanger-Lewandowski et al. GD train a 
recurrent neural network to produce chord classifications using 
input of PCA-whitened magnitude DFT. In a similar direction, 
Humphrey and Bello | |^ build a DNN that maps input 
spectrogram features to guitar-specific fingerings of chords. 

B. Two types of deep architectures 

We now review two different architectures of deep learning 
systems, and the way they are trained. A DNN is an artificial 
neural network with several hidden layers GD- The output 
of each layer is a non-linear function of its inputs, obtained 
by a matrix multiplication cascaded with a non-linearity, 
e.g., tanh, sigmoid and rectifier. By chaining together several 
hidden layers, composite representations of the input emerge 
in deeper layers. This fact can give deep networks greater 
representational power than shallower networks containing an 
equivalent number of parameters 0 - 


A CDNN is a special type of DNN with weights that are 
shared between multiple points between adjacent layers. The 
weight sharing in CDNNs not only reduces the number of 
trainable parameters, but also causes matrix multiplications to 
reduce to convolutions, which can be implemented efficiently. 
Furthermore, many natural signals have local spatial or tempo¬ 
ral structures that are repeated globally. For example, natural 
images often consist of oriented edges; and audio signals often 
consist of harmonic and repetitive structures. CDNNs can learn 
these types of structures very well. Figure [T] illustrates our 
CDNN, which we discuss in the following subsection. 

The contemporary success of deep learning comes with 
computationally efficient training methods. Systems that have 
such deep architectures are usually trained using gradient 
descent, which consists of backpropagating error derivatives 
from the cost function through the network. There are a 
plethora of useful tips and tricks to augment training, including 
stochastic gradient descent, dropout regularisation, weight 
decay, momentum, learning rate decay, and so on GD- 

C. Deep learning with two music genre benchmarks 

We now build DNNs and CDNNs using two music genre 
benchmarks: GTZAN p9l , [ |55] and the Latin Music Database 
(LMD) | [45| . GTZAA consists of 100 30-second music record¬ 
ing excerpts in each of ten categories, and is the most-used 
public dataset in MIR research | |50| . LMD is a private dataset, 
consisting of 3,229 full-length music track recordings non- 
uniformly distributed among ten categories, and has been used 
in the annual MIREX audio latin music genre classification 
evaluation campaign since 2008^ We use the first 30 seconds 
of each track in LMD. 

We build several DNNs and CDNNs using different par¬ 
titionings of these datasets. One partitioning of GTZAN 

* http://www.music-ir.org/mirex/wiki/MIREX_HOME 
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we create by randomly selecting 500/250/250 excerpts for 
training/validation/testing. The other partitioning of GTZAN 
is “fault-hltered,” which we construct by hand to include 
443/197/290 excerpts. This involves removing 70 hies includ¬ 
ing exact replicas, recording replicas, and distorted hies | [49) , 
and then dividing the excerpts such that no artist is repeated 
across the training, validation, and test partitions. We partition 
LMD in two ways; 1) partitioning by 60/20/20% sampling 
in each class; 2) a hand-constructed artist-hltered partitioning 
containing approximately the same division of excerpts in each 
class. We retain all 213 replicas in 

The input to our systems is derived from the short-time 
Fourier transform (STFT) of a sampled audio signal x ||Tj: 

L-l 

J^{x)[m, u] = ^ w[l]x[l - ui7]e-^27rm//L 
i=0 


where the parameter L dehnes both the window length and the 
number of frequency bins. We dehne w as a Hann window of 
length L = 1024, which corresponds to a duration of 46ms 
for recordings sampled at 22050 FIz. The window is hopped 
along X with a stride of H = 512 samples (adjacent windows 
overlap by 50%). 

Since audio signals can be of any duration, we dehne the 
input to our systems as a sequence X = {Xn)^XQ, where the 
sequence length depends on the input audio’s duration. We 
dehne the nth element of the input sequence X to be 




(^F{x)[m^u\ : m S [0, 512], rt S [nT, (n-|-l)T[^ (2) 


where T = 1 for each DNN and T = 100 for each CDNN. 
Thus, when T = 1, 26 is a sequence of 513 x 1 vectors; when 
T = 100, 26 is a sequence of 513 x 100 matrices. 

A (C)DNN processes each element in this sequence inde¬ 
pendently, outputting a sequence P — {Pn)n=o from the hnal 
(softmax) layer. The output vector Pn € [0,1]^, ||F’n||i = 1, 
is the posterior distribution of labels assigned to the nth 
element in the input sequence by the network. Therefore, we 
may write P„(/|26„, 0) = P„(/) S [0,1] where 0 represents 
the trainable network parameters, i.e., the set of weights and 
biases. We dehne the confidence of a (C)DNN in a particular 
label k G {1 ,..., F6} for an input sequence 26 as the sum of 
all posteriors, i.e.. 


i?(fc|26, e) = ^ ^ P„(fc|26„, 0). (3) 

n—0 

We apply a label to an input sequence 26 as the one maximis¬ 
ing the conhdence 


y(26,0)=arg max R(k\X,<d). (4) 

Paralleling the work in | |44) , we build DNNs with 3 fully 
connected hidden layers, and either 50 or 500 units per layer. 
Our CDNN has two convolutional layers (accompanied by 
max pooling layers) followed by a fully connected hidden 
layer with 50 units. Figure [T] illustrates the architecture of our 
CDNN. Its hrst convolutional layer contains 32 hlters, each 


arranged in a rectangular 400 x 4 grid. We choose this long 
rectangular shape instead of the small square patches typically 
used when training on images based on our knowledge that 
many sounds exhibit strong harmonic structures that span a 
large portion of the audible spectrum. The second convolu¬ 
tional layer contains 32 hlters, each connected in an 8 x 8 
pattern. Our two pooling neighborhoods are 4 x 4 and have 
strides of 2 x 2. All of our deep learning systems use rectihed 
linear units (ReLUs), and have a softmax unit in the hnal layer. 
As is typical, we standardise the (C)DNN inputs by subtracting 
the training set mean and dividing by the standard deviation 
in each of the input dimensions. We perform this with a linear 
layer above the input layer of each network. The raw inputs 
to the network are still 26„. 

Also paralleling | |44) , we build several music classihcation 
systems treating our DNN as a feature extractor. In this case, 
we construct a set of features by concatenating the activations 
from the DNN’s three hidden layers, and aggregating them 
over 5-second texture windows (hopped by 50%). The ag¬ 
gregation summarises the mean and standard deviation of the 
feature dimensions over the texture window and may be seen 
as a form of late-integration of temporal information. We use 
this new set of features to train a random forest (RF) classiher 
1291 with 500 trees. Thus, to classify a music audio recording 
X from its set of aggregated features, we use majority voting 
over all classihcations, which is also used in in. 


D. Preliminary evaluation 

Figure and Table show the results of RF classihcation 
using the features produced by the DNN when trained on 
GTZAN with the two different partitioning strategies; and Fig. 

shows those for the (C)DNNs we train and test in LMD. 
Across each partition strategy we see signihcant differences 
in performance. The mean recall in each class in Figure 
on the fault-hltered partition is much lower than that on 
the random test partition — involving drops higher than 30 
percentage points in most cases. Table shows similar drops 
in performance that persist over the inclusion of drop-out 
regularisation. Such signihcant drops in performance from 
partitioning based on artists is not unusual, and has been 
studied before as a bias coming from the experimental design 
12^ , | [39| , | |49) . Partitioning a music genre recognition dataset 
along artist lines has been recommended to avoid this bias 
| |2^ , 1^, and is in fact used in several MIREX audio 
classihcation tasksj^ Experiments using GTZAN with fault- 
hltering partitioning has not been used in many benchmark 
experiments with GTZAN because its artist information has 
only recently been made available | |49) . 

III. Adversaries in music content analysis 

An adversary is an agent that tries to defeat a classihcation 
system in order to maximise its gain, e.g., SPAM detection. 
Dalvi et al. GD pose this problem as a game between a 
classiher and adversary, and analyse the strategies involved 
for an adversary with complete knowledge of the classihcation 


^ https://highnoongmt.wordpress.eom/2014/02/08/faults_in_the_latin_music_database 


■http://www.music-ir.org/mirex/wiki/MIREX_HOME 
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Fig. 2. Figure of merit (FoM, X100) in GTZAN with two different partitionings for random forest classification (majority vote) of DNN-based features (all 
layers) aggregated over 5 second windows (mean and standard deviations). Each DNN has 500 rectified linear units in each hidden layer. Columns represent 
the true class; rows denote labels chosen by system; the diagonal contains the per-class recall; the off-diagonal entries are confusions; the rightmost column 
is the precision; the bottom row is the F-score; and the last element along the diagonal is the mean recall (normalised classification accuracy). 


Hidden Units 

Layer 

ReLU 

ReLU+Dropout 


I 

76.00 

(40.69) 

80.40 

(45.17) 

50 

2 

78.80 

(45.17) 

80.40 

(43.10) 

3 

79.60 

(43.79) 

78.80 

(44.48) 


All 

80.40 

(43.79) 

80.00 

(43.79) 


1 

68.40 

(40.34) 

75.60 

(40.69) 

500 

2 

74.40 

(40.69) 

80.00 

(50.34) 

3 

77.60 

(43.79) 

79.20 

(48.62) 


All 

76.00 

(42.41) 

81.20 

(48.97) 


TABLE I 

Mean normalised classification accuracy (x 100) in GTZAN for 

RANDOM FOREST CLASSIFICATION OF DNN-BASED FEATURES FROM 
LAYER SHOWN AGGREGATED OVER 5-SECOND WINDOWS. NUMBER 
OUTSIDE BRACKETS IS FROM RANDOM PARTITION IN FlG.j^A); AND THAT 
INSIDE BRACKETS IS FROM FAULT-FILTERED PARTITION IN FlG.j^B). 


system, and for a classifier to adapt to such an adversary. 
Szegedy et al. | [^ propose using adversaries for testing the 
assumption that deep learning systems are “smooth classifiers,” 
i.e., stable in their classification to small perturbations around 
examples in the training data. They define an adversary of 
a classifier / : M"* -A- as an algorithm using 

complete knowledge of the classifier to perturb an observation 
X S K’" such that f{x + r) ^ fix), where r G K™ is some 
small perturbation. Specifically, their adversary solves the 
constrained optimisation problem for a given k G {1,..., K}: 

min ||r ||2 subject to f{x + r) = k. (5) 

For k ^ fix), Szegedy et al. employ a line search 
along the direction of the loss function of the network starting 
from X until the classifier produces the requested class. They 
find that adversarial examples of one classifier can fool other 
classifiers trained on independent data; hence, one need not 
have complete knowledge of a classifier in order to fool it. 

Goodfellow et al. |[2^ provide an intuitive explanation 


of these adversaries: even though the perturbations in each 
dimension might be small, their contribution to the magnitude 
of a projection grows linearly with input dimensionality. With 
a deep neural network involving many such projections in each 
layer, a small perturbation at its high-dimensional input layer 
can create major consequences at the output layer. Goodfellow 
et al. show that adversarial examples can be easily 

generated by making the perturbation proportional to the sign 
of the partial derivative of the loss function used to train a 
particular network, evaluated with the requested class. They 
also find that the direction of perturbation is important, not 
necessarily its size. Hence, it seems adversarial examples of 
one model will likely fool other models because they occur in 
large volumes in high-dimensional spaces. This is also found 
by Gu and Rigazio | |27) . 

As for Szegedy et al. |j5^, we are interested the robustness 
of our deep learning music content analysis systems to an 
adversary. Do these systems suffer just as dramatically as 
the image content recognition systems in 0, 1^, g? In 
other words, can we find imperceptible perturbations of audio 
recordings, yet make the systems produce any label with high 
confidence? If so, can we adapt the training of the systems 
such that they become more robust? In the next subsections, 
we define an adversary as an optimisation problem, but with 
care of the fact that the input to our deep learning systems are 
magnitude STFT Q. We then present an approach to integrate 
adversaries into the training of our systems. We present our 


experimental results in Section IV 


A. Adversaries for music audio 

The explicit goal of our adversary is to perturb a music 
recording x such that a system will confidently classify it 
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(a) DNN Random pailitioning 


(b) DNN Aitist-filtered partitioning 
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(c) CDNN Random partitioning 


(d) CDNN Artist-filtered partitioning 


Fig. 3. FoM for deep learning systems with two different partitioning strategies of LMD. Interpretation as in Fig.[^ but note that in this case we are using 
the deep learning systems as the classifiers, instead of performing classification using a random forest with features derived from hidden layer activations. 


with some class y G {1,..., ICj. Specifically, we define the 
adversary as the constrained optimisation problem: 


N-l 

X{y) = arg min ^ C{Zn,y\e) (6) 

^ ^ ^ n=0 

where we define the feasible set of adversarial examples to 
input sequence X as: 

CiX) = {z= {ZX=0 ■■ 

\/En=o\\Zn-Xr.\\l < 7Ve(SNR)} (7) 


with the parameter 


e(SNR) = 


N V ^n—0 


\Xr, 


2QSNR/20 


( 8 ) 


limiting the maximum acceptable perturbation caused by the 
adversary. The loss function in (|^ is the cross-entropy loss 
function, £(X„,j/|0) := — log P„(t/|X„, 0), which we use 
in training our (C)DNNs. Given the network parameters 0, 
this adversary can compute the derivative of this loss function 
by backpropagating derivatives through the network. This sug¬ 
gests that our adversary can accomplish its goal by searching 
for a new input sequence X via gradient descent on the loss 
function with any label y that differs from the ground truth. 
This is the approach used by Szegedy et al. ||54) in the context 
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Algorithm 1 From exemplar sequence X search for adver¬ 
sarial sequence X with maximal perturbation SNR in at most 
fcmax steps that makes a (C)DNN with parameters 0 apply 
label y with confidence 
1 : parameters: y, SNR, p, i?min,0, fcmax 
2 : ink: X^°'> =X,k = 0 
3: repeat 

4: y 3 — X^^'> -f fjXC{X^^\y\Q) {Gradient step} 

5: kF 3— 7^GL(max(0, F)) (Find valid sequence} 

6 : 17 ^ max(0, - ^n|li/e(SNR) - 1 ) 

{Lagrange mult.} 

7: ^ (1 -f v)-^{W + uX) {SNR constraint} 

8i k i — k \ 

9: until i Pn{y\xi^^^^’^) > i^^in or k= fc^ax 

10 : return: X = X^^'> 


of image object recognition. 

A local minimum of can be found using projected 
gradient descent, initialised with the exemplar •(— X, 

and iterating 

Xik+i) + y.VCiX^'^\y\e)) (9) 

where the scalar y, is the gradient descent step size, and 
Vci') computes the least squares projection of its argument 
onto the set C(X) defined in (j^. Note that we define op¬ 
erations on sequences element-wise, e.g., V£(A^*^\vl 0 ) = 

{XC{xk^\y\e))^-,\ 

The main difficulty with this approach is that not all 
sequences X can be mapped back to valid time-domain 
signals x. This is because the analysis in Q uses overlapping 
windows, which causes adjacent elements in the sequence X to 
become dependent. This means that individual elements from 
the sequence X cannot be adjusted arbitrarily if we want X 
to have an analog in the time-domain. Therefore, in order to 
generate valid adversarial examples, we include an additional 
processing step that projects the sequence X onto the space 
of time-frequency coefficients arising from valid time-domain 
sequences. This is done using the Griffin and Lim algorithm 
| |25] , which seeks to minimise 

7V-1 

Vgl{X) = mm ll^n - X„\\l (10) 

n—0 

where X = {X = (A„)^Lg^ : Vgl{X) = X} denotes the set 
of all valid sequences. This minimization can be performed 
using alternating projections, and we have found that in 
practice it is sufficient to apply a single set of projections. We 
do this by first rebuilding a complex valued time-frequency 
representation from the sequence X 

U[m, u] = 

0 < m < D 

^L«/Tj[-D-TO,umodr]eJ‘*'[™’“I D < m < L. ^ 

where D = L/2 + 1 and ^[m,u\ = LT{x) is the phase 
from the exemplar’s Fourier transform. The inverse Fourier 


Algorithm 2 Train (C)DNN using database of labeled se¬ 
quences (X,Y) and fast adversarial generation p^ , with 
e and y the gradient descent step sizes for adjusting the 
adversarial inputs and network weights, respectively. 

1 : parameters: e, y 

2 : ink: (C)DNN parameters 0 to small random weights 

3: repeat 

4: select Y uniformly -{1,..., 

5: X ^ X -f eV£(X, Y|0) {Generate adversarial ex.} 

6 : 0^0-1- yXC(±, Y\Q) {Model update} 

7: until Stopping condition 


transform F~^{U) is a time-domain signal, and so the Fourier 
transform of this signal, T o will yield a valid DFT 

spectrum that can be used to build a valid input sequence for 
our (C)DNN, i.e., by replacing J-{x) by A o X~^{U) in (|^. 

The pseudo-code in Alg. [T] summarises this approach. The 
algorithm may be terminated when the mean posterior of the 
target adversarial label exceeds the threshold Rmin, or after a 
maximum number of epochs (in which case an adversary 
cannot be found above the minimum SNR). 

B. Training with adversaries for music audio 

As per 0 and p4) , we can attempt to use our adversary as 
a regularise^ and to create systems robust against adversarial 
inputs. In particular, we create adversaries for the (C)DNN 
discussed above, and use them to generate a (possibly) infinite 
supply of new samples during training. The iterative procedure 
for generating adversaries in Alg is too slow to be practical 
for training, which requires on the order of 50 to 200 training 
epochs. Therefore, we apply the single gradient step procedure 
suggested in p4) . In our experience, this procedure often gen¬ 
erates inputs that confuse the network, although not typically 
with a high confidence. The pseudo-code in Alg. illustrates 
our training algorithm, where (X,Y) represent the training 
data, i.e., the set of input audio sequences and their labels, 
and Y is a set of adversarial labels. 

IV. Experimental Results 

We can design an adversary (Alg. such that it will attempt 
to make a system behave in different ways. For instance, 
an adversary could attempt to perturb an input within some 
limit (SNR) such that the (C)DNN makes a high-confidence 
classification (i?min ~ 1 ) that is correct with probability p. 
Another adversary could attempt to make the system label any 
input using the same label. We can also make an ensemble of 
adversaries such that they produce adversarial examples that 
a (C)DNN classifies in every possible way. 

We define our adversaries (Alg.[T]| using; i?min = 0.9, SNR 
= 15dB, y = 0.1, and fcmax = 100, and with the directive to 
make the (C)DNN correct with probability p = 0.1. More 
concretely, for each test observations, the adversary draws 
uniformly one of the dataset labels y, then seeks to find in 
no more than fc^ax = 100 iterations using step size y = 0.1 
a valid perturbation no larger than 15dB SNR, and which the 
(C)DNN labels as y with confidence i?min = 0.9. Figure Qa) 
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(a) GTZAN fault-filtered: 23.0 ± 4.5dB 


(b) LMD artist-filtered: 15.78 ± 4.65dB 


Fig. 4. For the DNN-based classifier in Fig.|^b) and the CDNN in Fig.|^d), but with all input intercepted by an adversary intent on making the maximum 
posterior correct with probability p = 0.1. For this adversary, i?min = 0.9, SNR = 15dB, fi = 0.1, and fcmax = 100. Sub-captions show the resulting SNR 
(mean it standard deviation) for the adversarial test sets of GTZAN {N = 290) and LMD (N = 646). 


Music excerpt 

Blues 

Classical 

Country 

ClaSv 

Disco 

sificatioi 

Hiphop 

1 in GT 

Jazz 

TAN 

Metal 

Pop 

Reggae 

Rock 

Little Richard, “Last Year’s Race Horse” 

32 (23) 

29 (23) 

36 (25) 

36 (26) 

36 (25) 

33 (24) 

32 (24) 

31 (25) 

42 (26) 

36 (25) 

Rossini, “William Tell Overture” 

32 (25) 

37 (30) 

40 (29) 

43 (28) 

34 (24) 

36 (29) 

33 (25) 

34 (26) 

37 (26) 

37 (28) 

Willie Nelson, “A Horse Called Music” 

25 0 

25 (20) 

30 (27) 

30 (20) 

26 (19) 

30 (25) 

27 (23) 

21 (20) 

30 (23) 

29 (23) 

Simian Mobile Disco, “10000 Horses Can’t Be Wrong” 

31 (30) 

36 (31) 

38 (32) 

45 (34) 

41 (33) 

40 (32) 

33 (31) 

47 (34) 

42 (33) 

38 (33) 

Rubber Bandits, “Horse Outside” 

27 (27) 

27 (27) 

36 (29) 

42 (31) 

38 (29) 

34 (28) 

32 (28) 

37 (29) 

36 (29) 

35 (29) 

Leonard Gaskin, “Riders in the Sky” 

32 (23) 

30 (25) 

32 (23) 

35 (25) 

31 (22) 

35 (29) 

34 (23) 

26 (23) 

35 (25) 

35 (24) 

Jethro Tull, “Heavy Horses” 

29 (26) 

28 (26) 

40 (29) 

42 (29) 

38 (28) 

36 (28) 

34 (28) 

34 (28) 

37 (28) 

36 (29) 

Echo and The Bunnymen, “Bring on the Dancing Horses” 

29 (25) 

28 (26) 

38 (28) 

43 (28) 

35 (26) 

34 (26) 

33 (26) 

33 (26) 

36 (27) 

38 (28) 

Count Prince Miller, “Mule Train” 

32 (30) 

29 (30) 

41 (33) 

37 (34) 

43 (33) 

36 (31) 

33 (31) 

42 (34) 

40 (33) 

33 (33) 

Rolling Stones, “Wild Horses” 

30 (22) 

32 (24) 

37 (25) 

40 (25) 

31 (22) 

34 (25) 

31 (26) 

32 (23) 

37 (25) 

37 (26) 


TABLE II 

SNR OF PERTURBATIONS PRODUCED BY TWO ENSEMBLES OF ADVERSARIES THAT INTERCEPT THE INPUT TO THE SYSTEM IN FlG.j^B) AND HAVE IT 
PRODUCE ALL CLASSIFICATIONS POSSIBLE WITH CONFIDENCE THRESHOLDS Rmin = 0.5 (Rmin = 0.9 IN BRACKETS). THE AVERAGE SNR IS 34.5 
(26.8) dB. This table can be heard at http://www.eecs.qmul.ac.uk/~sturm/research/DNN_adversaries 


shows the FoM of the DNN-based classification system in Fig. 
I^b), but with input intercepted by this adversary. Note that in 
this case the classification is performed by the same random 
forest classifier using the aggregated hidden layer activations, 
but the adversary is unaware of this. In other words, it is only 
trying to force the DNN to misclassify inputs that have been 
subject to minor perturbations. Compared with a normalised 
accuracy of 0.49 in Fig. Hb), we see our adversary has 
successfully confused the random forest classifier to be no 
better than random. Figure shows one of the adversarial 
examples from this experiment. Apart from some significant 
high-frequency deviations, the spectrum of the adversary is 
very similar to that of the original. The SNR in this example 
is 21.1dB. 

Figure Qb) shows the FoM of the CDNN classification 
system in Fig. |^d) attacked by the same adversary. In this 
case, the CDNN proved more difficult to fool, but still the 
adversary is able to significantly reduce the normalised clas¬ 
sification accuracy from 0.63 to 0.41 with high confidence 


classifications at rather high SNR. If we reduce the minimum 
confidence i^min = 0.5 and lessen the SNR constraint to 
—300 dB, then the adversary makes the CDNN perform even 
worse: a normalised accuracy of 0.28 with a mean SNR of 
11.15 ±8.32 dB. 

For the same system in Fig. [^b), and using iimin = 0.9, 
SNR = 15dB, ^ = 0.1 and = 100, we show in that 
we able to create adversaries that make the system always 
right, always wrong, and always select “Jazz.” Table [n| shows 
the results of two ensembles of adversaries, each intent on 
making the system in Fig. |^b) choose one of every label in 
GTZAN for the same music with SNR = 15dB, p = 0.1 and 
fcmax = 100. The adversaries of one ensemble insist upon a 
classification confidence of at least i?niin = 0.5; and in the 
other of at least 0.9. These music recordings are the same 30- 
second excerpts used in pS] . We see that in all case by one, 
the ensembles are able to elicit high confidence classifications 
from the system with minor perturbations of the input. We also 
see that larger perturbations are produced on average when the 
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Fig. 5. Top left: spectrogram excerpt from GTZAN Classical “21” (Mozart, Symphony No. 39 Finale) that the DNN-based system in Fig. Sb) classifies 
as Classical. Top middle: spectrogram of adversarial example classified as Reggae. Top right: spectrogram of the difference of the two. Bottom: magnitude 
spectrum of one frame (1024 samples) of the original (light blue), adversarial example (black), and difference (orange). Note that all excerpts in GTZAN have 
a sampling rate of 22050 Hz. The SNR = 21.1dB. 


Deep Learning System 

Norm. 

Acc 

Norm. Acc. 
w/ Adversary 

SNR (dB) 
mean ± std. dev. 

DNN-LMD Fig. 

3 b) 

0.63 

0.03 

37.8±4.6 

DNN-LMD+J 

kDV 

0.55 

0.06 

36.5±5.4 

CDNN-LMD Fig. 

3d) 

0.63 

0.21 

9.62±5.8 

CDNN-LMD+J 

u!)v 

0.56 

0.21 

9.74±6.4 


TABLE III 

Results of applying adversary to make systems in Fig.[^b,d) 

ALWAYS INCORRECT, AND AFTER TRAINING WITH ADVERSARY (ALG.[^. 

adversaries insist on a higher minimum confidence: 34.5 dB 
for a confidence of at least i?min = 0.5, and 26.8 dB for a 
confidence of at least i?min = 0.9. 

These results can be heard here: http://www.eecs.qmul.ac. 
uk/~sturm/research/DNN_adversaries We find that the pertur¬ 
bations caused by these adversaries are certainly perceptible, 
unlike those found for image data in | |54[ and p4) ; however, 
the distortion is very minor, and the music remains exactly the 
same, e.g., pitches, rhythm, lyrics, instrumentation, dynamics, 
and style all remain the same. 

We now perform an experiment to compare (C)DNNs 
trained with adversarial examples (as per Alg.|^ to the systems 
in Fig.[^b,d). To do this we test the response of these systems 
against an adversary aimed at always eliciting an incorrect 
response. (This is different from the adversary used above, 
which seeks to make the system correct with probability 
p = 0.1.) For this experiment, we set i?niin = 0.5 and SNR 
to —300 dB in order to allow arbitrarily large perturbations to 
force misclassifications. Table uni illustrates the results of this 
experiment from which we observe several interesting results. 
Column 1 shows the normalized accuracy on the original test 
set (with no adversary present). We see that training against 
adversarial examples leads to a slight deflation in accuracy on 
new test data. Column 2 shows the normalized accuracy of 
these systems against our adversary intent on forcing a 100% 
error rate. We see that the CDNN systems are more robust to 


this adversary, and that the systems trained against adversarial 
examples confer little to no advantage. Column 3 shows the 
average perturbation size of the adversarial examples that 
led to misclassifications. We notice that larger perturbations 
(corresponding to lower SNRs) were required to get the 
CDNN systems to misclassify test inputs. The minimum SNR 
produced was 0.11 dB, while the maximum was 47.6 dB. 
The results of this experiment point to the conclusions that 

a) the CDNN systems are more robust to this adversary; and 

b) training against adversarial examples (contrary to what we 
hypothesized) does not seem reduce the misclassification rate 
against new adversarial examples. A possible explanation for 
the latter results is that, due to the high-dimensional nature 
of the input space, the set of possible adversarial examples is 
densely packed, so that training on a small number of these 
points is not sufficient to allow the systems to generalize to 
new adversarial examples. 

V. Discussion 

Returning to the broadest question motivating our work, we 
seek to measure the contribution of deep learning to music 
content analysis. The previous sections describe a series of 
experiments we have conducted using deep learning systems 
of a variety of architectures, which we have trained and tested 
in two different partitions of two benchmark music datasets We 
have evaluated the robustness of these systems to an adversary 
that has complete knowledge of the classifiers, and have also 
investigated the use of an adversary in the training of deep 
learning systems. 

Our experimental results in Fig. and Table [I] are essentially 
reproductions of those reported in | |44) . Based on the results 
of their experiments with random partitionings of GTZAN, 
Sigtia et al. ED claim that their DNN-based systems learn 
features that “better represent the audio” than standard or 
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(a) GTZAN fault-filtered (b) LMD artist-filtered 

Fig. 6. As in Fig. FoM for majority vote of minimum Mahalanobis distance classification of mean and variances over 5-second “texture” windows of 
zero-crossings and the first 13 MFCCs computed from 46 ms windows hopped 50%. 


“hand-crafted” features, e.g., those referenced in | |30) like 
MFCCs. Similar conclusions are made about the deep learning 
systems in also based on experiments using a random 
partitioning of GTZAN. However, we see in Fig. and Table 
H] that when we consider the faults in the GTZAN dataset and 
partition it along artist lines, as for the LMD dataset in Fig. 
our deep learning systems perform significantly worse. This is 
an expected outcome ||2^, p9) , | |49t , but the artist information 
in GTZAN was not available until 2012 | |43| . 

This motivates the question of whether DNN-based systems 
really do perform better than that of a classifier using standard, 
low-level and “hand-crafted” features. To examine this, we 
build baseline systems that use low-level features, and train 
and test them in the same fault-filtered partition of GTZAN 
as in Fig. [^b), and the artist-filtered partition of LMD as in 
Fig. [3b,d). Mimicking Q, we compute these features 
based on a short-time analysis using 46ms frames hopped 
by 50%. From each frame we extract the first 13 Mel- 
frequency cepstral coefficients (MFCCs) and zero-crossings, 
and compute their mean and variance over five-second texture 
windows (which are also hopped by 50%). We combine the 
features of the training and validation sets of the fault-filtered 
partition of GTZAN, and the artist filtered partition of LMD. 
Both systems use a minimum Mahalanobis distance classifier, 
and assign a class by majority vote from the classifications 
of the individual texture windows. Figure shows the FoM 
produced by these baseline systems. We see that for GTZAN 
it actually reproduces more ground trTith than the DNN in Fig. 
I^b) and all but one in Table |I] Our simple baseline system 
for LMD reproduces much less ground truth than the (C)DNN 
in Fig. [^b,d). Nonetheless, we have no reason to accept 
the conclusion that deep learning features “perform better” 
than “hand-crafted” features for the particular architectures 
considered here and those in (g. Different experiments 
are needed to address such a conclusion. 


A tempting conclusion is that since the normalised classifi¬ 
cation accuracies in Figs.|^b) and[^d) are extremely unlikely 
to arise by chance (p < 10“®^ for GTZAN and p < 10“^®° for 
LMD by a Binomial test) it is therefore entirely reasonable to 
reject the hypothesis that our (C)DNN are choosing outputs at 
random. Hence, one might argue that these (C)DNN must have 
learned features that are “relevant” to music genre recognition 
| |28t , pTI , ig. This argument appears throughout the MIR 
research discipline |jg, and turns on the strong assumption 
that there are only two ways a system can reproduce the 
ground truth of a dataset; by chance or by learning to solve 
a specific problem thought to be well-posed by a cleanly 
labeled dataset m- In fact, there is a third way a system can 
reproduce the ground truth of a music dataset: by learning to 
exploit characteristics shared between the training and testing 
datasets that arise not from a relationship in the real world, 
but from the curation and partitioning of a dataset in the 
experimental design of an evaluation pS] , p9) , | |g . Since 
the evaluations producing Figs. and [3|as well as all results 
in ig, g, not to mention a significant number of published 
studies in MIR | |49| , do not control for this third way, we 
cannot validly conclude upon the “relevance” of whatever has 
been learned by these music content analysis systems. 

A notion of this problem is given by the significant de¬ 
creases in the FoM we measure when partitioning GTZAN and 
LMD along artist lines. By doing so, we are controlling for 
some independent variables that a system might be exploiting 
to reproduce ground truth, but which arguably have little 
relevance to the high-level labels of the dataset |jg. More 
concretely, consider that all 100 excerpts labeled Pop in 
GTZAN come from recordings of music by four artists, 25 
from each artist. If we train and test a system on a random 
partition of GTZAN, we cannot know whether the system is 
recognising Pop, recognising the artist, or recognising other 
aspects that may or may not be related to Pop. If we train a 
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system instead with Pop excerpts by three artists, test with the 
Pop excerpts by the fourth artist, then we might be testing 
something closer to Pop recognition. This all depends on 
dehning what knowledge is relevant to the problem. 

A common retort to these arguments is that a system should 
be able to reproduce ground truth “by any means.” One thereby 
dehnes “relevant knowledge” as any correlations that helps a 
system reproduce an amount of dataset ground truth that is 
inconsistent with chance. However, this can lead to circular 
reasoning: system X has learned “relevant knowledge” because 
it reproduces Y amount of ground truth; system X reproduces 
Y amount of ground truth because it has learned “relevant 
knowledge.” It is also deaf to one of the major aims of research 
in music content analysis 0: “to make music, or information 
about music, easier to hnd.” If a music content analysis system 
is describing music in ways that do not align with those of 
its users, then its usability is in jeopardy no matter its FoM 
in benchmark datasets p2) , | [5^ . Finally, this means that the 
problem thought to be well-posed by a cleanly labeled dataset 
can be many things simultaneously — which leads to the 
problem of how to validly compare apples and oranges 0. 
In other words, why compare systems when they are solving 
different problems? This also applies to the comparisons above 
with the FoM in Fig. 

While we have no idea whether our (C)DNN systems in 
Fig. are exploiting “irrelevant” characteristics in LMD, our 
experimental results with adversaries in Figs. 0 and 1 ^ and 
Tables [I^ and indicate that their decision machinery is 
incredibly sensitive in very strange ways. Our adversaries are 
able to fool the high-performing deep learning systems by 
perturbing their input in minor ways. Auditioning the results 
in Table show that while the music in each recording 
remains exactly the same, and the perturbations are very small, 
the DNN is nearly always fooled into choosing with high 
conhdence every class it has supposedly learned. The CDNN 
is similarly defeated by our adversary; however, it is quite 
notable that it requires perturbations of far lower SNR than 
does the DNN. We are currently studying the reasons for this. 

Our application of adversaries here is close to the “method 
of irrelevant transformations” that we apply in pS) , | |52) , 
| [5^ to assess the internal models of music content analysis 
systems, and to test the hypothesis, “the system is using 
relevant criteria to make its decisions.” In | |48l , we take a 
brute force approach whereby we apply random but linear 
time-invariant and minor filtering to inputs of systems trained 
in three different music recording datasets until their FoM 
becomes perfect or random. We also make each system apply 
ever y one of its classes to the same music recordings in Table 
In 1^, we instead apply subtle pitch-preserving time- 


II 


stretching of music recordings to fool a deep learning system 
trained in the benchmark music dataset BALLROOM | |^ . 
We hnd that through such a transformation we can make the 
system perform perfectly or no better than random by applying 
tempo changes of at most 6% to test dataset recordings. We 
hnd a similar result for the same kind of deep learning system 


“^These results can be auditioned here: http://www.eecs.qmul.ac.uk/~sturm/ 
research/TM_expt2/index.html 


but trained in LMD 152) . 

Our adversary in Alg. moves instead right to the achilles 
heel of a deep learning system, coaxing it to behave in arbitrary 
ways for an input simply by making minor perturbations to 
the sampled audio waveform that have no effect on the music 
content it possesses. We observe in Fig. and auditioning 
Table that the low- to mid-frequency content of adversarial 
examples differs very little from the original recordings, but 
hnd more signihcant differences in the high-frequency spectra. 
This suggests that the distribution of energy in the high- 
frequency spectrum has signihcant impact on the decision 
machinery of our (C)DNN. The apparent high relevance of 
such slight characteristics in proportion to that of the actual 
musical content of a music recording does not bode well 
for one of the most important aims of machine learning: 
generalisation. 

As observed by Goodfellow et al. p4| in their deep learning 
systems taught to recognise objects in images, the impressive 
FoM we measure of our deep learning systems may be merely 
a colourful “Potemkin village.” Employing an adversary to 
scratch a little below the surface reveals the FoM to be 
curiously hollow. A system that appears to be solving a 
complex problem but actually is not is what we term a “horse” 
ii, which is a nod to the famous horse Clever Hans: a real 
horse that appeared to be a capable mathematician but was 
merely responding to involuntary cues that went undetected 
because his public demonstrations had no validity to attest to 
such an ability. Measuring the number of correct answers Hans 
gives in an uncontrolled environment does not give reason 
to conclude he comprehends what he appears to be doing. 
It is the same with the experiments we perform above with 
systems labelling observations in GTZAN and LMD. In fact, 
Goodfellow et al. | |24l come to the same conclusion: “The 
existence of adversarial examples suggests that ... being able 
to correctly label the test data does not imply that our models 
truly understand the tasks we have asked them to perform” 
12^ . This observation is now well-known in MIR ez)-0, 
but deserves to be repeated. 

VI. Conclusion 

In this article, we have shown how to adapt the adversary 
of Szegedy et al. | [54j to work within the context of music 
content analysis using deep learning. We have shown how our 
adversary is effective at fooling deep learning systems of dif¬ 
ferent architectures, trained on different benchmark datasets. 
We hnd our convolutional networks are more robust against 
this adversary than our deep neural networks. We have also 
sought to employ the adversary as part of the training of these 
systems, but hnd it results in systems that remain as sensitive 
to the same adversary. 

It is of course not very popular for one to be an “adversary” 
to research, moving quickly to refute conclusions and break 
systems reported in the literature; however, we insist that 
breaking systems leads ultimately to progress. Considerable 
insight can be gained by looking behind the veil of perfor¬ 
mance metrics in an attempt to determine the mechanisms 
by which a system operates, and whether the evaluation is 




12 


any valid reflection of the qualities we wish to measure. Such 
probing is necessary if we are truly interested in ascertaining 
what a system has learned to do, what its vulnerabilities might 
be, how it compares to competing systems supposedly solving 
the same problem, and how well we can expect it to perform 
when used in real-world applications. 
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